Back

BMC Bioinformatics

Springer Science and Business Media LLC

Preprints posted in the last 30 days, ranked by how well they match BMC Bioinformatics's content profile, based on 383 papers previously published here. The average preprint has a 0.37% match score for this journal, so anything above that is already an above-average fit.

1
geneslator: an R package for comprehensive gene identifier conversion and annotation

Cavallaro, G.; Micale, G.; Privitera, G. F.; Pulvirenti, A.; Forte, S.; Alaimo, S.

2026-04-01 bioinformatics 10.64898/2026.03.30.714723 medRxiv
Top 0.1%
28.9%
Show abstract

MotivationHigh-throughput sequencing generates large gene lists, making data interpretation challenging. Accurate gene annotation and reliable conversion between identifiers (e.g., gene symbols, Ensembl GeneIDs, Entrez GeneIDs) are essential for integrating datasets, conducting functional analyses, and enabling cross-species comparisons. Existing tools and databases facilitate annotation but often suffer from inconsistencies, missing mappings, and fragmented workflows, limiting reproducibility and interpretability. ResultsTo address these limitations, we developed geneslator, an R package that unifies gene identifier conversion, orthologs mapping, and pathway annotation across eight model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana). geneslator provides an up-to-date, precise, and coherent framework that preserves data integrity, enables cross-species analyses, and facilitates robust interpretation of gene function and regulation, outperforming state-of-the-art gene annotation tools. Availabilitygeneslator is available at https://github.com/knowmics-lab/geneslator. Contactgrete.privitera@unict.it

2
MOAflow: how re-design a pipeline with Nextflow streamlines data analysis

Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.

2026-03-30 bioinformatics 10.64898/2026.03.26.713914 medRxiv
Top 0.1%
28.7%
Show abstract

BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.

3
From SNPs to Pathways: A genome-wide benchmark of annotation discrepancies and their impact on protein- and pathway-level inference

Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.

2026-03-24 bioinformatics 10.64898/2026.03.21.713397 medRxiv
Top 0.1%
23.7%
Show abstract

BackgroundAccurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. ResultsWe conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. ConclusionSNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.

4
BCAR: A fast and general barcode-sequence mapper for correcting sequencing errors

Andrews, B.; Ranganathan, R.

2026-03-31 bioinformatics 10.64898/2026.03.27.714882 medRxiv
Top 0.1%
23.5%
Show abstract

MotivationDNA barcodes are commonly used as a tool to distinguish genuine mutations from sequencing errors in sequencing-based assays. In the presence of indel errors, utilizing barcodes requires accurate alignment of the raw reads to distinguish genuine indels from indel errors. Existing strategies to do this generally rely on aligners built for homology comparison and do not fully utilize quality scores. We reasoned that developing an aligner purpose-built for error correction could yield higher quality barcode-sequence maps. ResultsHere, we present BCAR, a fast barcode-sequence mapper for correcting sequencing errors. BCAR considers all of the evidence for each base call at each position both during alignment and during final consensus generation. BCAR creates high-accuracy barcode-sequence maps from simulated reads across a broad range of error rates and read lengths, outperforming existing methods. We apply BCAR to two experimental datasets, where it generates high-quality barcode-sequence maps. Availability and implementationBCAR source code, documentation and test data are available from: https://github.com/dry-brews/BCAR

5
VaLPAS: Leveraging variation in experimental multi-omics data to elucidate protein function

Mahlich, Y.; Ross, D. H.; Monteiro, L.; McDermott, J. E.

2026-03-30 bioinformatics 10.64898/2026.03.26.712966 medRxiv
Top 0.3%
18.8%
Show abstract

MotivationDespite continuing advances in sequencing and computational function determination, large parts of the studied gene, protein, and metabolite space remain functionally undetermined. Most function assignment is driven by homology searches and annotation transfer from known and extensively studied proteins but often fails to leverage available experimental omics data generated via technologies like mass-spectrometry. ResultsThe VaLPAS (Variation-Leveraged Phenomic Association Screen) framework is available as a Python package and provides a user-friendly platform for calculation of associations between expression patterns of genes or proteins in multi-omic datasets based on various statistical and learning methods. The goal of this approach is to shed light on the functional dark matter of protein space by elucidating previously unknown functions of molecules using guilt by association with molecules of known function. We present results demonstrating the utility of VaLPAS to identify high-confidence predictions for a subset of genes/proteins of unknown function in a previously published multi-omics dataset from the oleaginous yeast, Rhodotorula toruloides. AvailabilityVaLPAS is written in Python. The code is hosted on github (https://github.com/PNNL-Predictive-Phenomics/valpas/).

6
CCIDeconv: Hierarchical model for deconvolution of subcellular cell-cell interactions in single-cell data

Jayakumar, R.; Panwar, P.; Yang, J. Y. H.; Ghazanfar, S.

2026-03-30 bioinformatics 10.64898/2026.03.26.714643 medRxiv
Top 0.3%
18.3%
Show abstract

MotivationCell-cell interaction (CCI) underlies several fundamental mechanisms including development, homeostasis and disease progression. CCI are known to be localised to specific subcellular regions, for example, within the cytoplasms of cells. With the emergence of subcellular spatial transcriptomics technologies (sST), there is an opportunity to attribute CCI to subcellular regions. We aimed to deconvolute CCI to subcellular CCI (sCCI) in non-spatial single cell transcriptomics data (i.e. scRNA-seq) datasets using a modified CCI score from CellChat. ResultsBy calculating the sCCI score specific to cytoplasm and nucleus in nine publicly available sST datasets, we identified unique nucleus-nucleus and cytoplasm-cytoplasm sCCI. Then, we deconvolved the communication score to subcellular regions by using a hierarchical classification and regression model which we name as CCIDeconv. We performed leave-one-dataset-out cross-validation across nine datasets over a range of different tissue types from human samples. We observed that training across many different tissue types resulted in robust deconvolution performance in an unseen dataset. As the number of training datasets increased, models trained without spatial features achieved similar performance as models including spatial features. This implied the potential for accurate prediction of sCCI events from even scRNA-seq with large numbers of training datasets. Overall, we offer a method towards attributing CCI events to subcellular regions. This method can allow researchers in dissecting sCCI patterns to gain insights in underlying biology in a range of tissues covering health and disease.

7
Machine Learning-Enhanced Nanopore ITS Analysis: Evaluating CPU-GPU Pipelines for High-Accuracy Fungal Taxonomic Resolution

Albuja, D. S.; Maldonado, P. S.; Zambrano, P. E.; Olmos, J. R.; Vera, E. R.

2026-04-07 bioinformatics 10.64898/2026.04.06.716835 medRxiv
Top 0.4%
17.0%
Show abstract

Accurate fungal species identification is critical for microbial ecology, food safety, and plant pathology. However, morphological limitations and genomic complexity hinder this process. Molecular markers such as the ITS region, along with Oxford Nanopore long-read sequencing, offer a robust solution, albeit limited by error rates in homopolymeric regions and a high dependence on advanced computational resources (GPUs) to achieve high accuracy. This study benchmarks two bioinformatics workflows on a multiplexed dataset of complex fungal communities to address this technological gap: a CPU-based workflow optimized using a Bayesian machine learning engine and a GPU-accelerated workflow incorporating "super high accuracy" (SUP) models and refinement with neural networks. The results establish a scalable framework for evaluating the impact of computational architecture on final taxonomic resolution. It is demonstrated that GPU processing maximizes data retention and species-level accuracy by correcting systematic errors. Alternately, implementing automated hyperparameter optimization in CPU environments stabilizes sequence clustering and achieves high taxonomic concordance at the genus level. This conceptual advance validates the feasibility of performing ITS metabarcoding analysis in resource-constrained infrastructures, thus providing the scientific community with a reproducible protocol that balances the need for taxonomic precision with hardware availability.

8
CLEAR: Concise List Enrichment Analysis Reducing Redundancy

Jia, X.; Phan, A.; Dorman, K.; Kadelka, C.

2026-04-01 bioinformatics 10.64898/2026.03.30.715378 medRxiv
Top 0.4%
16.5%
Show abstract

MotivationHigh-throughput experiments generate genome-wide measurements for thousands of genes, which are often tested marginally. Biological processes are driven by coordinated groups of genes rather than individual genes, making gene set enrichment analysis an essential post hoc interpretation tool. Traditional approaches such as Over-Representation Analysis and Gene Set Enrichment Analysis test gene sets independently, which ignores the hierarchical and overlapping structure of gene set collections such as the Gene Ontology, and often leads to redundant enrichment results. Set-based approaches such as MGSA address this issue by modeling multiple gene sets simultaneously, but they rely on binary gene activation states derived from arbitrary thresholds on gene-level statistics. ResultsWe introduce Concise List Enrichment Analysis Reducing Redundancy (CLEAR), a Bayesian gene set enrichment framework that jointly models gene sets while incorporating continuous gene-level statistics such as test statistics or p-values. CLEAR extends model-based gene set analysis by replacing threshold-based gene activation with a probabilistic model for continuous gene-level statistics. This approach preserves the redundancy-reduction advantages of set-based enrichment methods while avoiding the information loss introduced by binarization. Using both simulated datasets and human gene expression data, we show that CLEAR improves sensitivity compared with existing enrichment approaches while producing a more concise and interpretable set of enriched gene sets. Availability and implementationThe source code, data, and a brief tutorial are freely available at https://github.com/jiatuya/CLEAR

9
A graph-based learning approach to predict the effects of gene perturbations on molecular phenotypes

Jin, Y.; Sverchkov, Y.; Sushkova, A.; Ohtake, M.; Emfinger, C.; Craven, M.

2026-03-23 systems biology 10.64898/2026.03.20.712202 medRxiv
Top 0.4%
15.0%
Show abstract

MotivationLarge-scale gene knockdown/knockout screens have been used to gain insight into a wide array of phenotypes and biological processes. However, conducting such experiments is expensive and labor-intensive. In this work, we present a general graph-based machine-learning approach that can predict the effects of gene perturbations on molecular phenotypes of interest given some measured phenotypic effects of other gene perturbations. The motivation for learning models that can predict the effects of gene perturbations is fourfold. Such models can (1) predict effects for unmeasured genes in cases in which cost or technical barriers preclude perturbing every gene, (2) prioritize unmeasured genes or sets of genes for subsequent perturbation experiments, (3) hypothesize mechanisms that underlie the relationships between the perturbed genes and their effects, and (4) generalize to other unmeasured phenotypes of interest. ResultsWe evaluate our approach by applying it, in conjunction with four different learning methods, to learn models for four varied phenotypes. Our empirical evaluation demonstrates that the learned models (1) show relatively high levels of predictive accuracy across the four phenotypes, (2) have better predictive accuracy than several standard baselines, (3) can often learn accurate models with small training sets, (4) benefit from having multiple sources of evidence in the input representation, (5) can, in many cases, transfer their predictive value to other phenotypes. Availability and ImplementationThe Assembled datasets and source code for this work is available at: https://github.com/Craven-Biostat-Lab/graph-molecular-phenotype-prediction

10
Homology-based perspective on pangenome graphs

Lisiecka, A.; Kowalewska, A.; Dojer, N.

2026-03-18 bioinformatics 10.64898/2026.03.16.712038 medRxiv
Top 0.5%
14.5%
Show abstract

Pangenome graphs conveniently represent genetic variation within a population. Several types of such graphs have been proposed, with varying properties and potential applications. Among them, variation graphs (VGs) seem best suited to replace reference genomes in sequencing data processing, while whole genome alignments (WGAs) are particularly practical for comparative genomics applications. For both models, no widely accepted optimization criteria for a graph representing a given set of genomes have been proposed. In the current paper we introduce the concept of homology relation induced by a pangenome graph on the characters of represented genomic sequences and define such relations for both VG and WGA model. Then, we use this concept to propose homology-based metrics for comparing different graphs representing the same genome collection, and to formulate the desired properties of transformations between VG and WGA models. Moreover, we propose several such transformations and examine their properties on pangenome graph data. Finally, we provide implementations of these transformations in a package WGAtools, available at https://github.com/anialisiecka/WGAtools.

11
A Permutation-Based Framework for Evaluating Bias in Microbiome Differential Abundance Analysis

Zeng, K.; Fodor, A. A.

2026-03-18 bioinformatics 10.64898/2026.03.14.711836 medRxiv
Top 0.5%
14.1%
Show abstract

BackgroundIn microbiome research, differential abundance analysis aids in identifying significant differences in microbial taxa across two or more conditions. Statistical approaches used for this purpose include classical tests such as the t-test and Wilcoxon test, as well as methods designed to account for the compositional nature of microbiome data, including ALDEx2, ANCOM-BC2, and metagenomeSeq. In addition, methods originally developed for RNA sequencing data, such as DESeq2 and edgeR, have been frequently applied to microbiome studies. However, the use of these methods has been controversial. One area of concern is whether different modeling frameworks produce accurate p-values when the null hypothesis is true. ResultsWe evaluated eight methods across six publicly available datasets. Four permutation strategies were applied to generate data under the null hypothesis: shuffling sample names, shuffling counts within samples, shuffling counts within taxa, and fully randomizing the counts table. Methods based on the negative binomial distribution (DESeq2 and edgeR) produced p-values that were consistently smaller than expected under the null hypothesis. In contrast, methods that attempt to correct for compositionality (ALDEx2, ANCOM-BC2, and metagenomeSeq) tended to produce larger-than-expected p-values, even when only sample labels were shuffled, a permutation strategy that does not alter compositional structure. These deviations were dependent on dataset characteristics and permutation strategy, suggesting complex interactions between underlying data structure and algorithm performance. Generating data to follow the expected negative binomial distribution did not eliminate the tendency of DESeq2 and edgeR to exaggerate statistical significance. Although similar patterns were observed in RNA sequencing (RNAseq) datasets, the deviations were less pronounced than in microbiome data. In contrast, the classical t-test and Wilcoxon test yielded p-value distributions consistent with theoretical expectations across datasets and permutation strategies. ConclusionsThese results indicate that the performance of several widely used differential abundance methods can be problematic under null conditions and may affect biological interpretation. Our findings emphasize the importance of careful method selection and highlight the robustness of simpler statistical approaches for reliable inference.

12
Ryder: Epigenome normalization using a two-tier model and internal reference regions

Cao, Y.; Ge, G.; Zhao, K.

2026-03-18 bioinformatics 10.64898/2026.03.15.711886 medRxiv
Top 0.5%
14.1%
Show abstract

MotivationSequencing-based epigenomic profiling methods are powerful but suffer from technical variability that complicates cross-sample comparisons and can obscure true biological signals. While existing normalization methods using spike-in controls or computational approaches have been proposed, they often rely on assumptions that may not hold across diverse experimental conditions or require additional data types. ResultsWe present Ryder, a flexible and robust Python package for the normalization and differential analysis of epigenomic data. Ryder introduces a normalization strategy that leverages stable internal reference regions, such as invariant CTCF binding sites, to correct for technical artifacts genome-wide. Our results show that it effectively models and adjusts both background noise and signal intensity, ensuring accurate signal alignment across samples. We demonstrate that Ryder performs robust, genome-wide normalization - correcting signals in both peak and background regions - across a range of assays including DNase-seq, CUT&RUN, ATAC-seq, MNase-seq, and ChIP-seq, with or without spike-in controls. By reducing technical noise, we show that Ryder improves the detection of genuine biological changes, such as quantitative reduction of chromatin accessibility at key enhancer elements by depletion of BRG1, a key subunit of the chromatin remodeling BAF complexes. Availability and ImplementationThe Ryder source code and documentation are freely available at: https://github.com/YaqiangCao/ryder.

13
Interpolating and Extrapolating Node Counts in Colored Compacted de Bruijn Graphs for Pangenome Diversity

Parmigiani, L.; Peterlongo, P.

2026-03-18 bioinformatics 10.64898/2026.03.16.711983 medRxiv
Top 0.6%
13.0%
Show abstract

A pangenome is a collection of taxonomically related genomes, often from the same species, serving as a representation of their genomic diversity. The study of pangenomes, or pangenomics, aims to quantify and compare this diversity, which has significant relevance in fields such as medicine and biology. Originally conceptualized as sets of genes, pangenomes are now commonly represented as pangenome graphs. These graphs consist of nodes representing genomic sequences and edges connecting consecutive sequences within a genome. Among possible pangenome graphs, a common option is the compacted de Bruijn graph. In our work, we focus on the colored compacted de Bruijn graph, where each node is associated with a set of colors that indicate the genomes traversing it. In response to the evolution of pangenome representation, we introduce a novel method for comparing pangenomes by their node counts, addressing two main challenges: the variability in node counts arising from graphs constructed with different numbers of genomes, and the large influence of rare genomic sequences. We propose an approach for interpolating and extrapolating node counts in colored compacted de Bruijn graphs, adjusting for the number of genomes. To tackle the influence of rare genomic sequences, we apply Hill numbers, a well-established diversity index previously utilized in ecology and metagenomics for similar purposes, to proportionally weight both rare and common nodes according to the frequency of genomes traversing them.

14
REBEL, Reproducible Environment Builder for Explicit Library resolution

Martelli, E.; Ratto, M. L.; Nuvolari, B.; Arigoni, M.; Tao, J.; Micocci, F. M. A.; Alessandri, L.

2026-04-07 bioinformatics 10.64898/2026.04.04.716498 medRxiv
Top 0.6%
12.8%
Show abstract

BackgroundAchieving FAIR-compliant computational research in bioinformatics is systematically undermined by two compounding challenges that existing tools leave unresolved: long-term reproducibility and accessibility. Standard package managers re-download dependencies from live repositories at every build, making environments vulnerable to library disappearance and version drift, and pinning a package version does not pin the versions of its transitive dependencies, causing divergences between builds performed at different points in time. Compounding this, packages from repositories such as CRAN, Bioconductor, and PyPI frequently omit critical system-level dependencies from their installation metadata, leaving users to manually discover which underlying library is missing or which version is required. Beyond these technical failures, constructing a truly reproducible environment demands expertise in containerization making reproducibility in practice a privilege and not a standard. FindingsWe present REBEL (Reproducible Environment Builder for Explicit Library Resolution), a framework that addresses both challenges through three dependency inference heuristics: (i) Deep Inspection of source code, (ii) Fuzzy Matching against a manually curated knowledge base, and (iii) Conservative Dependency Locking. The resolved dependency stack is then archived into a self-contained local store, enabling offline and deterministic rebuilds at any future time. We compared the installation of 1,000 randomly sampled CRAN packages in isolated Docker containers versus the standard package manager and REBEL resolved 149 of 328 standard installation failures (45.4%). Moreover through its DockerBuilder component, REBEL further generates fully reproducible Docker images from a plain text requirements file, making deterministic environment construction accessible without expertise in containerization. ConclusionsREBEL provides a practical foundation for FAIR-compliant, long-term reproducible bioinformatics analyses, making deterministic environment construction accessible to researchers regardless of their technical background. REBEL is freely available at https://github.com/Rebel-Project-Core

15
UQ-PhysiCell: An extensible Python framework for uncertainty quantification and model analysis in PhysiCell

L. Rocha, H.; Bucher, E.; Zhang, S.; Deshpande, A.; Bergman, D. R.; Heiland, R.; Macklin, P. R.

2026-04-08 systems biology 10.64898/2026.04.06.716692 medRxiv
Top 0.6%
12.8%
Show abstract

Agent-based models (ABMs) are widely used to study complex multiscale biological systems, particularly in cancer research. However, their high-dimensional parameter spaces, stochasticity, and computational costs pose significant challenges for uncertainty quantification, calibration, and systematic comparison of competing mechanistic hypotheses. PhysiCell has evolved into a growing ecosystem of open-source tools supporting physics-based multicellular modeling, including model construction, visualization, and data integration. However, despite these advances, systematic support for uncertainty-aware model analysis, scalable parameter exploration, and formal calibration workflows remains limited. Here, we introduce UQ-PhysiCell, an open-source Python package that enables uncertainty quantification, calibration, and model selection for PhysiCell models using a modular and scalable workflow. UQ-PhysiCell acts as a manager of PhysiCell simulation inputs and outputs, including parameters, initial conditions, rules, and MultiCellDS-compliant objects, and provides automated orchestration of large ensembles of simulations. The framework supports multiple levels of parallelism to accelerate the analysis, including the parallel execution of independent simulations, stochastic replicates, and downstream analysis tasks. UQ-PhysiCell integrates seamlessly with established Python libraries for sensitivity analysis, optimization, Bayesian inference, and surrogate modeling, allowing users to construct customized pipelines that match their modeling goals and computational resource requirements. By decoupling model execution from statistical analysis and emphasizing extensibility and reproducibility, UQ-PhysiCell lowers the barrier to applying rigorous uncertainty-aware methodologies to agent-based modeling and supports the systematic evaluation of PhysiCell models in biological and biomedical research. Author summaryWe developed UQ-PhysiCell to address a key challenge in agent-based modeling: the systematic quantification of uncertainty in complex stochastic simulations. PhysiCell is widely used to model multicellular biological systems, particularly in cancer research; however, practical tools for uncertainty analysis, calibration, and model comparison are often developed in an ad hoc manner. This makes the results difficult to reproduce and limits the ability to rigorously evaluate competing biological hypotheses. UQ-PhysiCell provides a flexible Python framework that manages the inputs and outputs of PhysiCell simulations and enables large-scale computational analysis. We designed the software to be modular, allowing users to build their own analysis pipelines and combine different methodologies for sensitivity analysis, calibration, and model selection. Rather than enforcing a single workflow, UQ-PhysiCell supports customization to match specific scientific questions and computational requirements. To make uncertainty-aware analyses feasible for computationally intensive agent-based models, UQ-PhysiCell implements multiple parallelism strategies, enabling the concurrent execution of simulations, stochastic replicates, and downstream analyses. By promoting reproducibility, scalability, and methodological flexibility, UQ-PhysiCell helps researchers move beyond single best-fit simulations toward more reliable and interpretable computational modeling.

16
Cellector: A tool to detect foreign genotype cells in scRNAseq data with applications in leukemia and microchimerism.

Heaton, H.; Behboudi, R.; Ward, C.; Weerakoon, M.; Kanaan, S.; Reichle, S.; Hunter, N.; Furlan, S.

2026-03-30 bioinformatics 10.64898/2026.03.26.714571 medRxiv
Top 0.6%
12.7%
Show abstract

The existence of rare, genetically distinct cells can occur in various samples such as transplant patients, naturally occurring microchimerism between maternal and fetal tissues, and cancer samples with sufficient mutational burden. Computational methods for detecting these foreign cells are vital to studying these biological conditions. An application that is of particular interest is that of leukemia patients post hematopoietic cell transplant (HCT). In many leukemias, a primary therapy is HCT, after which, the primary genotype of the bone marrow and blood cells should be of donor origin. If cells exist that are of the patients genotype and the cell type lineage of the particular leukemia, this is known as measurable residual disease (MRD). If the MRD is high enough, this may represent a relapse of the patients leukemia. Furthermore, accurately estimating the MRD is important for driving clinical decision making for these patients. Here we present Cellector, a computational method for identifying rare foreign genotype cells in single cell RNAseq (scRNAseq) datasets. We show cellector accurately detects microchimeric cells down to an exceedingly low percentage of these cells present (0.05% or lower).

17
PanXpress: Gene expression quantification with a pan-transcriptomic gapped k-mer index

Alves Ferreira, I.; Zentgraf, J.; Schmitz, J. E.; Rahmann, S.

2026-03-20 bioinformatics 10.64898/2026.03.19.712873 medRxiv
Top 0.6%
12.5%
Show abstract

MotivationMost existing workflows for quantifying bacterial gene expression from RNA-seq data rely on mapping reads to a (single) reference transcriptome, typically ignoring strain-level variation. When samples contain unknown or mixed strains, these workflows may introduce reference bias and fail to accurately capture strain-specific gene expression. Pan-transcriptomic approaches address this issue by using pan-transcriptomes as references, but existing solutions require multiple steps for pan-transcriptome construction, indexing, and expression quantification. ResultsWe introduce PanXpress, a unified framework for bacterial pantranscriptomics that performs pantranscriptome construction and indexing directly from genomic FASTA and GFF annotation files, alignment-free mapping of reads to genes from FASTQ samples, and gene expression quantification. The index, a multi-way Cuckoo hash table storing gapped k-mers with associated genes, preserves diversity on the k-mer level. Using simulated RNA-seq data from a mixture of Pseudomonas aeruginosa strains, PanXpress achieves mapping recall comparable to alignment-based methods such as Bowtie2 with higher precision and obtains accurate gene expression and log fold change estimates. On real P. aeruginosa RNA-seq data, using PanXpress pantranscriptomic reference increases the proportion of mapped reads and discovered expressed genes. The index of PanXpress is smaller than that of other tools and it provides faster analysis with consistent results, compared to other tools (Salmon, Kallisto, Bowtie2). PanXpress is thus an accurate and efficient method for bacterial gene expression analysis in complex samples. AvailabilityPanXpress is available at https://gitlab.com/rahmannlab/panxpress. Contactsven.rahmann@uni-saarland.de

18
Exploring transcriptomic and genomic latent variable correction approaches in differential expression analysis.

Appulingam, Y.; Jammal, J.; Ali, A.; Topp, S.; NYGC ALS Consortium, ; Iacoangeli, A.; Pain, O.

2026-04-08 bioinformatics 10.64898/2026.04.07.716914 medRxiv
Top 0.7%
10.8%
Show abstract

BackgroundDifferential expression analysis is a central tool for studying the biological processes altered in human diseases via transcriptomic signatures. However, transcriptomic datasets are systematically confounded by latent variables from two distinct sources: unmeasured technical and biological heterogeneity within the expression data, and expression differences driven by population stratification. Correction using expression-based surrogate variables (SVs) and genotype-based principal components (PCs) addresses these sources independently, yet no study has directly evaluated their combined use against either method alone within a differential expression framework. In this study we hypothesised that simultaneously including both correction layers would produce more biologically valid and reproducible results than either approach alone, and tested this in two independent RNA-seq datasets of amyotrophic lateral sclerosis (ALS) cases and controls with matching genotype data. ResultsFour nested differential expression models (corrected for PC-only, SV-only, both SV and PC, and neither PCs nor SVs) were evaluated across the KCLBB (96 cases and 52 controls) and ALS Consortium (272 cases and 35 controls) datasets. Models were evaluated on: cross-dataset effect size concordance, cross-dataset replicability quantified by the Jaccard Similarity Index, and biological recall against a curated reference set of 66 known ALS genes. The combined SV+PC framework consistently outperformed simpler models across all metrics. Replicability improved nearly ten-fold compared to the non-corrected model, (Jaccard index: 2.28% to 19.5%), and the combined framework exhibited a statistically significant 2.1% gain over the SV-only model. The biological recall ALS genes recovered doubled comparing to the SV correction alone. Crucially, effect size stability was preserved, with the combined model expanding the shared transcriptomic signal without sacrificing consistency. These findings remained generally robust to PC number in sensitivity analyses. ConclusionsThis study found that SVs and genotype PCs address non-redundant sources of confounding, and we recommend their combined use as standard practice in differential expression analysis where matched genotype data are available. Notably PCs capturing population structure can also be derived directly from RNA-seq data, extending the applicability of this framework to studies lacking matched genotype data. Although this analysis was restricted to ALS datasets, we expect these findings to generalise to other traits.

19
A Statistical Method to Estimate the Population-Level Frequencies of Plasmodium falciparum Haplotypes with Pfhrp2/3 Deletions in the Presence of Mixed-Clone Infections

Kayanula, L.; Verma, K.; Kumar Bharti, P.; Schneider, K. A.

2026-04-06 genetics 10.64898/2026.04.01.715806 medRxiv
Top 0.8%
10.4%
Show abstract

BackgroundThe World Health Organization (WHO) has raised concerns over increasing Pfhrp2/3 deletions, undermining the sensitivity of Pfhrp2-based rapid diagnostic tests (RDTs). Close monitoring of the population and a change in diagnostic methods are recommended if the prevalence of parasites with Pfhrp2/3 deletions exceeds 5%. In high transmission settings, accurate estimates are hampered by the frequent occurrence of mixed-clone infections (multiplicity of infection; MOI). Objective and MethodsIf parasites with and without deletions are present in an infection, standard molecular assays cannot detect the presence of the former. To accurately estimate frequencies of haplotypes with Pfhrp2/3 deletions in the presence of mixed infections, a novel statistical model that combines genetic/molecular information from Pfhrp2/3 with that from neutral markers is introduced. Maximum-likelihood estimates (MLEs) are obtained for haplotype frequencies characterized by markers at Phrp2/3 loci and loci for neutral markers. The expectation-maximization algorithm is used to derive the MLEs. The adequacy of the method (precision and accuracy) is assessed by numerical simulations. ResultsThe method was applied to an active surveillance study conducted in a tribal community in Jagdalpur, India, which enrolled febrile community members (n = 432) between October and November 2021. Four markers each at Pfhrp2 and Pfhrp3 are combined with one marker each at Pfmsp1 (which encodes P. falciparum merozoite surface protein 1) and Pfmsp2. Data from a total of 117 patients who had both P. falciparum infections and genetic information for the molecular markers underwent further analysis with the novel statistical method. ConclusionResults indicate that this novel method has promising statistical properties (asymptotic and in finite samples) and can be readily applied to real-world situations. A stable implementation of the method in R is provided. This novel approach enables accurate estimation of Pfhrp2/3 deletion frequencies in complex P. falciparum infections, addressing a key limitation of current molecular surveillance methods. Author summaryPlasmodium falciparum (Pf) causes the most severe form of human malaria, accounting for over 90% of cases. Rapid diagnostic tests (RDTs) have become a cornerstone of malaria control. These RDTs detect Pf-specific antigens in a blood drop. HRP2/3 emerged as the best antigen for such tests because it is Pf-specific and expressed in abundance. However, some parasites lack the genes that code for HRP2/3 proteins. If parasites in an infection have such gene deletions, RDT results can be false negative. The WHO considers the containment of such deletions a public health priority and recommends monitoring their prevalence. The detection of HRP deletions is challenging if parasites with and without deletions co-occur in infections because standard molecular assays cannot detect deletions in this situation. To overcome this challenge, we introduce a novel statistical method to estimate the frequency distribution of parasite variants with deletions. The method combines information from neutral molecular markers and from HRP-related markers to correct for unobservable information. Here we provide a derivation of the statistical model, a stable implementation, and test its statistical properties with synthetic and real data, thereby showing that our method is well-suited for the underlying problem.

20
EMITS: expectation-maximization abundance estimation for fungal ITS communities from long-read sequencing

O'Brien, A.; Lagos, C.; Fernandez, K.; Ojeda, B.; Parada, P.

2026-04-02 bioinformatics 10.64898/2026.03.31.715662 medRxiv
Top 0.8%
10.4%
Show abstract

As long-read amplicon sequencing becomes routine for fungal metabarcoding, species-level abundance estimation from ITS amplicons remains limited by naive best-hit classification, which misattributes reads among closely related species sharing similar ITS sequences and fragments abundance across redundant database entries. Here we present EMITS, a Rust-based tool that applies expectation-maximization (EM) to iteratively resolve ambiguous read-to-reference mappings from minimap2 alignments against the UNITE database, producing probabilistic specieslevel abundance estimates. EMITS includes platform-specific presets for Oxford Nanopore and PacBio chemistries and performs taxonomic aggregation across UNITE accessions. We validated EMITS using three complementary approaches: controlled simulations with tunable alignment noise, an Oxford Nanopore mock community of 10 fungal species with known composition, and a synthetic community of 21 species derived from UNITE reference sequences. In simulations, EM reduced L1 error by 80-92% compared to naive counting under realistic noise conditions. On the ONT mock community, EM correctly resolved within-genus species assignments where naive counting misattributed reads (e.g., Trichophyton mentagrophytes vs. T. simii; Penicillium species) and consolidated abundance across redundant database accessions. On the synthetic community, EM reduced false positive abundance by 54% and improved overall accuracy by 13.4%. Together with ITSxRust [OBrien et al., 2026] for upstream ITS extraction, EMITS provides a complete high-performance pipeline for long-read fungal amplicon profiling.